This document lists the exploratory data analysis, model build and analysis for Coasting on Couches, the term project for MGT 6203 Spring semester. Whilst the project document and slides/ presentation list a more human-readable version, this document combines some exposition with a lot of code to generate graphs, to put it simply.
The document will be shared as a Shiny app and in the raw RMd format.
Please run the following chunk to ensure all the necessary libraries are installed/ present should you wish to execute the chunks at your end. (Idea taken from this blogpost)
Please execute install.packages("pls") in the console if it’s the first time you’re running it. There could be some additional installations, depending on your operating system.
We had downloaded data from InsideAirbnb.com to the data folder. This will be available on our GitHub repo.
The data is in four parts (each city has all five elements): 1. listing: These are the actual Airbnb listings. The columns are defined here. This is set of 74 variables pertaining to a specific listing. 2. review: List of reviews per row in the listing table. 3. calendar: Price of a particular listing on a particular date, along with min-max nights for hire 4. neighbourhoods: List of neighbourhoods screened in the city 5. map: GeoJson shapefile showing district boundaries.
We will read all the data into dataset variables.
#Singapore
listing.sin <- read.csv("./data/SIN_listings.csv")
reviews.sin <- read.csv("./data/SIN_reviews.csv")
calendar.sin <- read.csv("./data/SIN_calendar.csv")
neighbourhoods.sin <- read.csv("./data/SIN_neighbourhoods.csv")
map.sin <- geojson_read("./data/SIN_neighbourhoods.geojson")
#Taipei
listing.tpe <- read.csv("./data/TPE_listings.csv")
reviews.tpe <- read.csv("./data/TPE_reviews.csv")
calendar.tpe <- read.csv("./data/TPE_calendar.csv")
neighbourhoods.tpe <- read.csv("./data/TPE_neighbourhoods.csv")
map.tpe <- geojson_read("./data/TPE_neighbourhoods.geojson")
#Tokyo
listing.nrt <- read.csv("./data/NRT_listings.csv")
reviews.nrt <- read.csv("./data/NRT_reviews.csv")
calendar.nrt <- read.csv("./data/NRT_calendar.csv")
neighbourhoods.nrt <- read.csv("./data/NRT_neighbourhoods.csv")
map.nrt <- geojson_read("./data/NRT_neighbourhoods.geojson")
#Hong Kong
listing.hkg <- read.csv("./data/HKG_listings.csv")
reviews.hkg <- read.csv("./data/HKG_reviews.csv")
calendar.hkg <- read.csv("./data/HKG_calendar.csv")
neighbourhoods.hkg <- read.csv("./data/HKG_neighbourhoods.csv")
map.hkg <- geojson_read("./data/HKG_neighbourhoods.geojson")
We first see the number of listings per city.
cities <- c("Singapore", "Tokyo", "Taipei", "Hong Kong")
no_of_listings <- c(nrow(listing.sin), nrow(listing.nrt), nrow(listing.tpe), nrow(listing.hkg))
no_of_listings.fig <- plot_ly(
x = cities,
y = no_of_listings,
type = "bar",
text = no_of_listings
)
no_of_listings.fig <- no_of_listings.fig %>% layout(title ="No of Listings Per City", yaxis = list(title="No of Listings"))
no_of_listings.fig
Clearly, Tokyo has the largest number of listings followed by Hong Kong, Taipei and then Singapore.
Let us also consider heatmaps of where the listings are in each city by neighbourhood. Admittedly, the InsideAirbnb website has a map for each city (here’s one for Taipei), but this does not show by district. At a later stage, this can be enhanced to see by variable or time-series.
generate_choropleth_by_city <- function (listing, map, city_name)
{
listings_by_neighbourhood <- listing %>%
count(neighbourhood_cleansed)
# neighbourhoods_zero <- neighbourhoods %>%
# filter(!neighbourhood %in% listings_by_neighbourhood$neighbourhood) %>%
# mutate(n = 0) %>%
# select(neighbourhood, n)
# listings_by_neighbourhood <- union(listings_by_neighbourhood, neighbourhoods_zero)
# print(listings_by_neighbourhood)
g <- list (
fitbounds = "locations",
visible = FALSE
)
fig <- plot_ly()
fig <- fig %>% add_trace(
type="choropleth",
geojson=map,
locations=listings_by_neighbourhood$neighbourhood_cleansed,
z=listings_by_neighbourhood$n,
colorscale="jet",
featureidkey="properties.neighbourhood"
)
fig <- fig %>% layout(
geo = g
)
fig <- fig %>% colorbar(title = "No of listings")
fig <- fig %>% layout(
title = paste0("Listings by Neighbourhood - ", city_name)
)
fig
}
generate_choropleth_by_city(listing.sin, map.sin, "Singapore")
generate_choropleth_by_city(listing.nrt, map.nrt, "Tokyo")
generate_choropleth_by_city(listing.hkg, map.hkg, "Hong Kong")
generate_choropleth_by_city(listing.tpe, map.tpe, "Taipei")
bin_districts <- function(listing, bins)
{
district_bins <- listing %>%
count(neighbourhood_cleansed) %>%
arrange(desc(n))%>%
mutate(nb_group = ntile(n,n=bins)) %>%
arrange(desc(nb_group))
return(district_bins)
}
bin_districts(listing.sin, 4)
## neighbourhood_cleansed n nb_group
## 1 Kallang 405 4
## 2 Geylang 340 4
## 3 Downtown Core 317 4
## 4 Outram 309 4
## 5 Rochor 270 4
## 6 Novena 235 4
## 7 Bedok 186 4
## 8 Bukit Merah 178 4
## 9 River Valley 149 4
## 10 Queenstown 143 4
## 11 Singapore River 108 4
## 12 Tanglin 90 3
## 13 Orchard 80 3
## 14 Clementi 72 3
## 15 Jurong East 71 3
## 16 Jurong West 59 3
## 17 Marine Parade 57 3
## 18 Newton 57 3
## 19 Bukit Timah 56 3
## 20 Woodlands 44 3
## 21 Hougang 43 3
## 22 Toa Payoh 40 3
## 23 Bishan 39 2
## 24 Serangoon 35 2
## 25 Pasir Ris 33 2
## 26 Bukit Batok 31 2
## 27 Tampines 30 2
## 28 Ang Mo Kio 26 2
## 29 Sembawang 26 2
## 30 Sengkang 19 2
## 31 Yishun 19 2
## 32 Southern Islands 18 2
## 33 Punggol 17 2
## 34 Choa Chu Kang 17 1
## 35 Museum 15 1
## 36 Bukit Panjang 14 1
## 37 Central Water Catchment 12 1
## 38 Marina South 3 1
## 39 Western Water Catchment 3 1
## 40 Mandai 2 1
## 41 Lim Chu Kang 1 1
## 42 Pioneer 1 1
## 43 Sungei Kadut 1 1
## 44 Tuas 1 1
We could also see this as barcharts.
bar_charts_by_neighbourhood <- function (listing, city_name, neighbourhoods)
{
listings_by_neighbourhood <- listing %>%
count(neighbourhood_cleansed) %>%
# rename(neighbourhood = neighbourhood_cleansed)
arrange(desc(n))
# print(listings_by_neighbourhood)
neighbourhoods_zero <- neighbourhoods %>%
filter(!neighbourhood %in% listings_by_neighbourhood$neighbourhood_cleansed) %>%
rename(neighbourhood_cleansed = neighbourhood) %>%
mutate(n = 0) %>%
select(neighbourhood_cleansed, n)
print(neighbourhoods_zero)
listings_by_neighbourhood <- union(listings_by_neighbourhood, neighbourhoods_zero)
# print(listings_by_neighbourhood)
fig<- plot_ly(y=listings_by_neighbourhood$neighbourhood_cleansed, x=listings_by_neighbourhood$n, type="bar", orientation="h") %>%
layout(yaxis=list(categoryorder = "total ascending"), title=paste("Listings per neighbourhood in", city_name))
fig
}
bar_charts_by_neighbourhood(listing.sin, "Singapore", neighbourhoods.sin)
## neighbourhood_cleansed n
## 1 Marina East 0
## 2 Straits View 0
## 3 Changi 0
## 4 Changi Bay 0
## 5 Paya Lebar 0
## 6 North-Eastern Islands 0
## 7 Seletar 0
## 8 Simpang 0
## 9 Boon Lay 0
## 10 Tengah 0
## 11 Western Islands 0
bar_charts_by_neighbourhood(listing.nrt, "Tokyo", neighbourhoods.nrt)
## neighbourhood_cleansed n
## 1 Aogashima Mura 0
## 2 Fussa Shi 0
## 3 Hachijo Machi 0
## 4 Higashiyamato Shi 0
## 5 Hinode Machi 0
## 6 Hinohara Mura 0
## 7 Inagi Shi 0
## 8 Kiyose Shi 0
## 9 Kozushima Mura 0
## 10 Mikurajima Mura 0
## 11 Miyake Mura 0
## 12 Mizuho Machi 0
## 13 Niijima Mura 0
## 14 Ogasawara Mura 0
## 15 Oshima Machi 0
## 16 Toshima Mura 0
bar_charts_by_neighbourhood(listing.hkg, "Hong Kong", neighbourhoods.hkg)
## [1] neighbourhood_cleansed n
## <0 rows> (or 0-length row.names)
bar_charts_by_neighbourhood(listing.tpe, "Taipei", neighbourhoods.tpe)
## [1] neighbourhood_cleansed n
## <0 rows> (or 0-length row.names)
The majority of listings in Taipei, Tokyo and Hong Kong are in tourist-heavy districts. While the top district in Tokyo is Shinjuku, that in Hong Kong is Yau Tsim Mong, Kowloon’s core urban area formed by the combination of Yau Ma Tei, Tsim Sha Tsui and Mong Kok. Taipei’s top district is Zhongzheng district (“中正區”), consisting of historic sites and cultural performances.
In contrast to the other three cities, the top district in Singapore is the high-end residential condo district, Kallang. Not tourist heavy, but close to the downtown’s many attractions. The tourist-heavy Geylang, Downtown Core and Outram districts appear after Kallang.
Taipei and Hong Kong have listings in all districts. But 11 districts in Singapore (mostly military installations or the airport) and 16 districts in Tokyo do not have any listings.
There’s a column called amenities in the dataset that appears to list all the self-reported amenities in the listing as a single comma-separated list. Let’s try to see this further.
For instance, here’s the longest list of amenities among Singapore listings.
listing_amenities.sin <- listing.sin %>%
mutate(amenities = str_replace(amenities, "\\[","")) %>%
mutate(amenities = str_replace(amenities, "\\]","")) %>%
mutate(amenities = str_replace_all(amenities, "\"","")) %>%
mutate(amenities = str_replace_all(amenities, ", " ,",")) %>%
mutate(amenities_list = as.list(strsplit(amenities, ","))) %>%
mutate(no_of_am = lengths(amenities_list)) %>%
mutate(Wifi = as.numeric(grepl('Wifi', amenities, fixed = TRUE))) %>%
mutate(Shampoo = as.numeric(grepl('Shampoo', amenities, fixed = TRUE))) %>%
mutate(Kitchen = as.numeric(grepl('Kitchen', amenities, fixed = TRUE)))
# listing_amenities.sin %>% select(amenities, Wifi, Shampoo, Kitchen, Patio)
max_amenities.sin <- listing_amenities.sin %>%
select(amenities, no_of_am) %>%
group_by() %>%
slice(which.max(no_of_am))
amenities_list_string <- as.list(strsplit(as.character(max_amenities.sin["amenities"]), ","))
amenities_list_string
## [[1]]
## [1] "Toaster" "Sound system"
## [3] "Safe" "Indoor fireplace"
## [5] "Backyard" "Hangers"
## [7] "Bed linens" "Hot water kettle"
## [9] "Freezer" "Coffee maker"
## [11] "Washer" "Cooking basics"
## [13] "Bathtub" "Hair dryer"
## [15] "Clothing storage" "Oven"
## [17] "Outdoor furniture" "Paid parking on premises"
## [19] "High chair" "Children\\u2019s books and toys"
## [21] "Dedicated workspace" "Crib"
## [23] "Dining table" "Free parking on premises"
## [25] "Pool" "Cleaning products"
## [27] "Wine glasses" "Cleaning before checkout"
## [29] "Game console" "Long term stays allowed"
## [31] "Drying rack for clothing" "Outdoor dining area"
## [33] "Private entrance" "Elevator"
## [35] "Patio or balcony" "Refrigerator"
## [37] "Dryer" "Microwave"
## [39] "Baby bath" "Gym"
## [41] "Wifi" "Children\\u2019s dinnerware"
## [43] "Smoke alarm" "Board games"
## [45] "Luggage dropoff allowed" "Shampoo"
## [47] "Breakfast" "Extra pillows and blankets"
## [49] "Heating" "Conditioner"
## [51] "Cable TV" "Hot tub"
## [53] "Hot water" "Stove"
## [55] "Body soap" "BBQ grill"
## [57] "Iron" "Essentials"
## [59] "Babysitter recommendations" "Pack \\u2019n play/Travel crib"
## [61] "Kitchen" "Changing table"
## [63] "First aid kit" "Dishes and silverware"
## [65] "Air conditioning" "Shower gel"
## [67] "TV with standard cable" "Fire extinguisher"
#"Shampoo,Kitchen,Long term stays allowed,Washer,Smart lock,Hair dryer,Dryer,Wifi,Hot water,TV,Air conditioning,Smoke alarm,Fire extinguisher"
Apropos nothing, we will use the following amenities as dummy variables for price: > “Shampoo,Kitchen,Long term stays allowed,Washer,Hair dryer,Wifi,Hot water,TV,Air conditioning”
Similarly, let us also further analyse the column host_verifications to see if we can generate dummy variables from there as well.
listing_host_verf.sin <- listing.sin %>%
mutate(host_verifications = str_replace(host_verifications, "\\[","")) %>%
mutate(host_verifications = str_replace(host_verifications, "\\]","")) %>%
mutate(host_verifications = str_replace_all(host_verifications, "\"","")) %>%
mutate(host_verifications = str_replace_all(host_verifications, ", " ,",")) %>%
mutate(host_verifications_list = as.list(strsplit(host_verifications, ","))) %>%
mutate(no_of_vf = lengths(host_verifications_list))
max_verf.sin <- listing_host_verf.sin %>%
select(host_verifications, no_of_vf) %>%
group_by() %>%
slice(which.max(no_of_vf))
host_verf_list_string <- as.list(strsplit(as.character(max_verf.sin["host_verifications"]), ","))
host_verf_list_string
## [[1]]
## [1] "'email'" "'phone'"
## [3] "'facebook'" "'google'"
## [5] "'reviews'" "'jumio'"
## [7] "'offline_government_id'" "'selfie'"
## [9] "'government_id'" "'identity_manual'"
## [11] "'work_email'"
Let’s take this list to generate dummy variables. > [‘email’, ‘phone’, ‘facebook’, ‘reviews’, ‘manual_offline’, ‘jumio’, ‘offline_government_id’, ‘government_id’, ‘work_email’]
Let’s generalise these two bits for all cities and create dummy variables for each one of them.
wrangle_amenities_hostvf <- function (listing)
{
listing <- listing %>%
mutate(amenities = str_replace(amenities, "\\[","")) %>%
mutate(amenities = str_replace(amenities, "\\]","")) %>%
mutate(amenities = str_replace_all(amenities, "\"","")) %>%
mutate(amenities = str_replace_all(amenities, ", " ,",")) %>%
mutate(amenities_list = as.list(strsplit(amenities, ","))) %>%
mutate(no_of_am = lengths(amenities_list)) %>%
mutate(Amenities_Wifi = as.numeric(grepl('Wifi', amenities, fixed = TRUE))) %>%
mutate(Amenities_Shampoo = as.numeric(grepl('Shampoo', amenities, fixed = TRUE))) %>%
mutate(Amenities_Kitchen = as.numeric(grepl('Kitchen', amenities, fixed = TRUE))) %>%
mutate(Amenities_Long_Term = as.numeric(grepl('Long term stays', amenities, fixed = TRUE))) %>%
mutate(Amenities_Washer = as.numeric(grepl('Washer', amenities, fixed = TRUE))) %>%
mutate(Amenities_HairDryer = as.numeric(grepl('Hair dryer', amenities, fixed = TRUE))) %>%
mutate(Amenities_HotWater = as.numeric(grepl('Hot water', amenities, fixed = TRUE))) %>%
mutate(Amenities_TV = as.numeric(grepl('TV', amenities, fixed = TRUE))) %>%
mutate(Amenities_AC = as.numeric(grepl('Air conditioning', amenities, fixed = TRUE))) %>%
mutate(host_verifications = str_replace(host_verifications, "\\[","")) %>%
mutate(host_verifications = str_replace(host_verifications, "\\]","")) %>%
mutate(host_verifications = str_replace_all(host_verifications, "\"","")) %>%
mutate(host_verifications = str_replace_all(host_verifications, ", " ,",")) %>%
mutate(host_verifications_list = as.list(strsplit(host_verifications, ","))) %>%
mutate(hv_email = as.numeric(grepl('email', host_verifications, fixed = TRUE))) %>%
mutate(hv_phone = as.numeric(grepl('phone', host_verifications, fixed = TRUE))) %>%
mutate(hv_facebook = as.numeric(grepl('facebook', host_verifications, fixed = TRUE))) %>%
mutate(hv_reviews = as.numeric(grepl('reviews', host_verifications, fixed = TRUE))) %>%
mutate(hv_manual_offline = as.numeric(grepl('manual_offline', host_verifications, fixed = TRUE))) %>%
mutate(hv_manual_jumio = as.numeric(grepl('jumio', host_verifications, fixed = TRUE))) %>%
mutate(hv_manual_off_gov = as.numeric(grepl('offline_government_id', host_verifications, fixed = TRUE))) %>%
mutate(hv_manual_gov = as.numeric(grepl('government_id', host_verifications, fixed = TRUE))) %>%
mutate(hv_manual_work_email = as.numeric(grepl('work_email', host_verifications, fixed = TRUE))) %>%
mutate(no_of_vf = lengths(host_verifications_list))
}
listing.sin <- wrangle_amenities_hostvf(listing.sin)
listing.nrt <- wrangle_amenities_hostvf(listing.nrt)
listing.tpe <- wrangle_amenities_hostvf(listing.tpe)
listing.hkg <- wrangle_amenities_hostvf(listing.hkg)
This is a function that wrangles AirBnb data into an analysable chunk. Because we will be doing the same for multiple cities, we will do a function out of this. The function is based on top of code shared in the lecture for Module 2. The obvious additions are the id column, neighbourhoods and dummy variables for amenities and host verification.
wrangle_airbnb_dataset <- function (raw_listing_full)
{
listing.raw <- raw_listing_full %>%
select(id, price,number_of_reviews,beds,bathrooms,accommodates,reviews_per_month, property_type, room_type, review_scores_rating, neighbourhood_cleansed, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, latitude, longitude, amenities, last_review, no_of_am, Amenities_Wifi, Amenities_Shampoo, Amenities_Kitchen, Amenities_Long_Term, Amenities_Washer, Amenities_HairDryer, Amenities_HotWater, Amenities_TV,Amenities_AC, host_verifications, hv_email,hv_phone, hv_facebook, hv_reviews, hv_manual_offline, hv_manual_jumio,hv_manual_off_gov, hv_manual_gov, hv_manual_work_email, no_of_vf) %>%
rename(Reviews = number_of_reviews) %>%
rename(Beds = beds) %>%
rename(Baths = bathrooms) %>%
rename(Capacity = accommodates) %>%
rename(Monthly_Reviews = reviews_per_month) %>%
rename(Property_Type = property_type) %>%
rename(Room_Type = room_type) %>%
rename(Price = price) %>%
rename(Rating = review_scores_rating) %>%
# rename(Neighbourhood = neighbourhood_cleansed) %>%
rename(host_Superhost = host_is_superhost)
listing.raw <- listing.raw %>%
mutate(Price = str_replace(Price, "[$]", "")) %>%
mutate(Price = str_replace(Price, "[,]", "")) %>%
mutate(Price = as.numeric(Price)) %>%
# mutate(hood_factor = as.factor(Neighbourhood)) %>%
mutate(host_response_rate = str_replace(host_response_rate, "[%]", "")) %>%
mutate(host_response_rate = as.numeric(host_response_rate)/100) %>%
mutate(host_acceptance_rate = str_replace(host_acceptance_rate, "[%]", "")) %>%
mutate(host_acceptance_rate = as.numeric(host_acceptance_rate)/100) %>%
mutate(host_Superhost = ifelse(host_Superhost =="f", 0, 1)) %>%
mutate(host_response_rate = factor(host_response_rate, levels = c("within a few hours", "within a day", "a few days or more"))) %>%
mutate(host_response_hours = ifelse(host_response_rate == "within a few hours"),1,0) %>%
mutate(host_response_day = ifelse(host_response_rate == "within a day"),1,0) %>%
mutate(host_response_few_days = ifelse(host_response_rate == "a few days or more"),1,0) %>%
mutate(last_review = as.Date(last_review)) %>%
mutate(Days_since_last_review = as.numeric(difftime(as.Date("2021-12-31"), last_review, units="days"))) %>%
mutate(Room_Type = factor(Room_Type, levels = c("Shared room", "Private room", "Entire home/apt"))) %>%
mutate(Capacity_Sqr = Capacity * Capacity) %>%
mutate(Beds_Sqr = Beds * Beds) %>%
mutate(Baths_Sqr = Baths * Baths) %>%
mutate(ln_Price = log(1+Price)) %>%
mutate(ln_Beds = log(1+Beds)) %>%
mutate(ln_Baths = log(1+Baths)) %>%
mutate(ln_Capacity = log(1+Capacity)) %>%
mutate(ln_Rating = log(1+Rating)) %>%
mutate(Shared_ind = ifelse(Room_Type == "Shared room",1,0)) %>%
mutate(House_ind = ifelse(Room_Type == "Entire home/apt",1,0)) %>%
mutate(Private_ind = ifelse(Room_Type == "Private room",1,0)) %>%
mutate(Capacity_x_Shared_ind = Shared_ind * Capacity) %>%
mutate(H_Cap = House_ind * Capacity) %>%
mutate(P_Cap = Private_ind * Capacity) %>%
mutate(ln_Capacity_x_Shared_ind = Shared_ind * ln_Capacity) %>%
mutate(ln_Capacity_x_House_ind = House_ind * ln_Capacity) %>%
mutate(ln_Capacity_x_Private_ind = Private_ind * ln_Capacity) %>%
filter(!is.na(Price))
return(listing.raw)
}
list.sin <- wrangle_airbnb_dataset(listing.sin)
list.nrt <- wrangle_airbnb_dataset(listing.nrt)
list.tpe <- wrangle_airbnb_dataset(listing.tpe)
list.hkg <- wrangle_airbnb_dataset(listing.hkg)
There’s value in understanding how many reviews a property has received in the last 12 months as a measure of how active a property is. The notion is that modelling price for active listings will be more accurate than modelling price for all listings.
The approach taken in this paper was to look listings active in the past 12 months. However, given restrictions because of pandemic, we felt it would be better to look at the period between 1 Jan 2019 and 31 Dec 2021, to include one year in addition to the two pandemic years, 2020 and 2021.
We check this by wrangling the review dataset.
count_reviews <- function(listings, reviews, from_date, to_date)
{
reviews_grouped <- reviews %>%
mutate(date = as.Date(date)) %>%
filter(between(date, as.Date(from_date), as.Date(to_date))) %>%
group_by(listing_id) %>%
summarise(reviews_since_2019 = n()) %>%
mutate(bookings_since_2019 = reviews_since_2019*2) %>%
rename(id = listing_id)
listings <- left_join(listings, reviews_grouped, by="id")
return(listings)
}
start_date = "2019-1-1"
end_date = "2021-12-31"
list.sin <-count_reviews(list.sin, reviews.sin,start_date, end_date)
list.hkg <- count_reviews(list.hkg, reviews.hkg,start_date, end_date)
list.tpe <- count_reviews(list.tpe, reviews.tpe,start_date, end_date)
list.nrt <- count_reviews(list.nrt, reviews.nrt,start_date, end_date)
list_after_2019.sin <- list.sin %>% filter(!is.na(reviews_since_2019))
list_after_2019.tpe <- list.tpe %>% filter(!is.na(reviews_since_2019))
list_after_2019.nrt <- list.nrt %>% filter(!is.na(reviews_since_2019))
list_after_2019.hkg <- list.hkg %>% filter(!is.na(reviews_since_2019))
Let’s try to look listings after 2019 in map form.
generate_choropleth_by_city(list_after_2019.sin, map.sin, "Singapore")
generate_choropleth_by_city(list_after_2019.nrt, map.nrt, "Tokyo")
generate_choropleth_by_city(list_after_2019.hkg, map.hkg, "Hong Kong")
generate_choropleth_by_city(list_after_2019.tpe, map.tpe, "Taipei")
bar_charts_by_neighbourhood(list_after_2019.sin, "Singapore", neighbourhoods.sin)
## neighbourhood_cleansed n
## 1 Marina East 0
## 2 Straits View 0
## 3 Changi 0
## 4 Changi Bay 0
## 5 Paya Lebar 0
## 6 North-Eastern Islands 0
## 7 Seletar 0
## 8 Simpang 0
## 9 Sungei Kadut 0
## 10 Boon Lay 0
## 11 Pioneer 0
## 12 Tengah 0
## 13 Tuas 0
## 14 Western Islands 0
## 15 Western Water Catchment 0
bar_charts_by_neighbourhood(list_after_2019.nrt, "Tokyo", neighbourhoods.nrt)
## neighbourhood_cleansed n
## 1 Aogashima Mura 0
## 2 Fussa Shi 0
## 3 Hachijo Machi 0
## 4 Higashiyamato Shi 0
## 5 Hinode Machi 0
## 6 Hinohara Mura 0
## 7 Inagi Shi 0
## 8 Kiyose Shi 0
## 9 Kozushima Mura 0
## 10 Mikurajima Mura 0
## 11 Miyake Mura 0
## 12 Mizuho Machi 0
## 13 Niijima Mura 0
## 14 Ogasawara Mura 0
## 15 Oshima Machi 0
## 16 Toshima Mura 0
bar_charts_by_neighbourhood(list_after_2019.hkg, "Hong Kong", neighbourhoods.hkg)
## [1] neighbourhood_cleansed n
## <0 rows> (or 0-length row.names)
bar_charts_by_neighbourhood(list_after_2019.tpe, "Taipei", neighbourhoods.tpe)
## [1] neighbourhood_cleansed n
## <0 rows> (or 0-length row.names)
add_earnings <- function(listing)
{
return (listing %>% mutate(earnings_since_2019 = bookings_since_2019 * 3 * Price))
}
list_after_2019.sin <- add_earnings(list_after_2019.sin)
list_after_2019.tpe <- add_earnings(list_after_2019.tpe)
list_after_2019.nrt <- add_earnings(list_after_2019.nrt)
list_after_2019.hkg <- add_earnings(list_after_2019.hkg)
Let’s group listings into groups of neighbourhoods: extremely popular, popular, moderate, not so popular, and sparse.
district_bins.sin <- bin_districts(list_after_2019.sin, bins=5)
district_bins.sin
## neighbourhood_cleansed n nb_group
## 1 Geylang 224 5
## 2 Kallang 221 5
## 3 Outram 201 5
## 4 Rochor 136 5
## 5 Downtown Core 134 5
## 6 Bedok 101 5
## 7 Bukit Merah 94 5
## 8 Novena 73 5
## 9 River Valley 67 4
## 10 Queenstown 52 4
## 11 Tanglin 38 4
## 12 Singapore River 36 4
## 13 Jurong West 32 4
## 14 Marine Parade 27 4
## 15 Jurong East 24 4
## 16 Orchard 24 4
## 17 Newton 22 3
## 18 Woodlands 22 3
## 19 Bukit Timah 21 3
## 20 Serangoon 21 3
## 21 Clementi 17 3
## 22 Bishan 16 3
## 23 Tampines 16 3
## 24 Hougang 15 3
## 25 Toa Payoh 13 2
## 26 Museum 9 2
## 27 Punggol 9 2
## 28 Ang Mo Kio 8 2
## 29 Yishun 8 2
## 30 Choa Chu Kang 7 2
## 31 Central Water Catchment 6 2
## 32 Pasir Ris 6 2
## 33 Bukit Batok 6 1
## 34 Bukit Panjang 6 1
## 35 Sengkang 5 1
## 36 Southern Islands 3 1
## 37 Marina South 2 1
## 38 Lim Chu Kang 1 1
## 39 Mandai 1 1
## 40 Sembawang 1 1
district_bins.nrt <- bin_districts(list_after_2019.nrt, bins=5)
district_bins.nrt
## neighbourhood_cleansed n nb_group
## 1 Shinjuku Ku 1653 5
## 2 Taito Ku 1147 5
## 3 Sumida Ku 810 5
## 4 Toshima Ku 707 5
## 5 Shibuya Ku 509 5
## 6 Ota Ku 357 5
## 7 Minato Ku 342 5
## 8 Chuo Ku 330 5
## 9 Nakano Ku 255 5
## 10 Setagaya Ku 251 4
## 11 Katsushika Ku 216 4
## 12 Kita Ku 181 4
## 13 Suginami Ku 177 4
## 14 Arakawa Ku 153 4
## 15 Shinagawa Ku 137 4
## 16 Koto Ku 136 4
## 17 Edogawa Ku 135 4
## 18 Itabashi Ku 110 4
## 19 Chiyoda Ku 110 3
## 20 Bunkyo Ku 106 3
## 21 Adachi Ku 73 3
## 22 Meguro Ku 47 3
## 23 Nerima Ku 44 3
## 24 Hachioji Shi 18 3
## 25 Hino Shi 15 3
## 26 Machida Shi 14 3
## 27 Chofu Shi 12 3
## 28 Fuchu Shi 11 2
## 29 Kokubunji Shi 10 2
## 30 Mitaka Shi 9 2
## 31 Akiruno Shi 7 2
## 32 Higashimurayama Shi 7 2
## 33 Kunitachi Shi 7 2
## 34 Musashino Shi 7 2
## 35 Tachikawa Shi 7 2
## 36 Tama Shi 7 2
## 37 Nishitokyo Shi 6 1
## 38 Kodaira Shi 5 1
## 39 Ome Shi 5 1
## 40 Komae Shi 4 1
## 41 Hamura Shi 3 1
## 42 Musashimurayama Shi 3 1
## 43 Okutama Machi 3 1
## 44 Akishima Shi 2 1
## 45 Higashikurume Shi 2 1
## 46 Koganei Shi 2 1
district_bins.hkg <- bin_districts(list_after_2019.hkg, bins=5)
district_bins.hkg
## neighbourhood_cleansed n nb_group
## 1 Yau Tsim Mong 1208 5
## 2 Wan Chai 311 5
## 3 Central & Western 279 5
## 4 Islands 211 4
## 5 Kowloon City 96 4
## 6 Eastern 70 4
## 7 Yuen Long 66 3
## 8 North 65 3
## 9 Sai Kung 47 3
## 10 Sham Shui Po 34 3
## 11 Sha Tin 24 2
## 12 Southern 24 2
## 13 Tai Po 19 2
## 14 Tuen Mun 12 2
## 15 Kwun Tong 9 1
## 16 Tsuen Wan 4 1
## 17 Kwai Tsing 3 1
## 18 Wong Tai Sin 3 1
district_bins.tpe <- bin_districts(list_after_2019.tpe, bins=5)
district_bins.tpe
## neighbourhood_cleansed n nb_group
## 1 萬華區 536 5
## 2 中正區 475 5
## 3 大安區 454 4
## 4 中山區 407 4
## 5 信義區 283 3
## 6 大同區 153 3
## 7 松山區 142 2
## 8 士林區 119 2
## 9 文山區 64 2
## 10 北投區 54 1
## 11 內湖區 49 1
## 12 南港區 24 1
list_after_2019.sin <- left_join(list_after_2019.sin, district_bins.sin %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.nrt <- left_join(list_after_2019.nrt, district_bins.nrt %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.hkg <- left_join(list_after_2019.hkg, district_bins.hkg %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.tpe <- left_join(list_after_2019.tpe, district_bins.tpe %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
# list_after_2019.sin
list_after_2019.sin <- dummy_cols(list_after_2019.sin, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.nrt <- dummy_cols(list_after_2019.nrt, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.tpe <- dummy_cols(list_after_2019.tpe, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.hkg <- dummy_cols(list_after_2019.hkg, select_columns = "nb_group", remove_selected_columns = TRUE)
# list_after_2019.sin
# list_after_2019.nrt
# list_after_2019.hkg
# list_after_2019.tpe
This reduces the number of listings, and hopefully, quite a few outliers.
cities <- c("Singapore", "Tokyo", "Taipei", "Hong Kong")
no_of_listings <- c(nrow(listing.sin), nrow(listing.nrt), nrow(listing.tpe), nrow(listing.hkg))
no_of_listings_after_2019 <- c(nrow(list_after_2019.sin), nrow(list_after_2019.nrt), nrow(list_after_2019.tpe), nrow(list_after_2019.hkg))
data <- data.frame(cities, no_of_listings, no_of_listings_after_2019)
no_of_listings.fig <- plot_ly(data,
x = cities,
y = ~no_of_listings,
type = "bar",
text = no_of_listings,
name = "No of Listings (All Years)"
)
no_of_listings.fig <- no_of_listings.fig %>% add_trace(y = ~no_of_listings_after_2019, text= no_of_listings_after_2019, name = "No of Active Listings")
no_of_listings.fig <- no_of_listings.fig %>% layout(title ="No of Listings Per City", yaxis = list(title="No of Listings"))
no_of_listings.fig
This roughly halves the number of listings being considered in Hong Kong and Singapore, but not in Taipei and Tokyo.
We now attempt to check on variables for each city.
data_exploration <- function (listing)
{
plot_str(listing, type="r")
introduce(listing)
plot_intro(listing)
plot_missing(listing)
plot_bar(listing)
pca_df <- na.omit(list.sin[, c("Price", "Room_Type", "Reviews", "Beds", "Capacity", "Monthly_Reviews", "host_Superhost", "Rating")])#,"Days_since_last_review", "host_response_rate", "host_response_hours", "host_acceptance_rate","host_response_day", "host_response_few_days")])
plot_qq(pca_df)
plot_prcomp(pca_df, variance_cap = 0.9, nrow = 2L, ncol=2L)
}
data_exploration(list.sin)
## 4 columns ignored with more than 50 categories.
## Property_Type: 51 categories
## amenities: 2815 categories
## last_review: 962 categories
## host_verifications: 140 categories
Taipei
data_exploration(list.tpe)
## 4 columns ignored with more than 50 categories.
## Property_Type: 58 categories
## amenities: 3376 categories
## last_review: 1050 categories
## host_verifications: 158 categories
Hong Kong
data_exploration(list.hkg)
## 4 columns ignored with more than 50 categories.
## Property_Type: 69 categories
## amenities: 3846 categories
## last_review: 1125 categories
## host_verifications: 150 categories
Tokyo
data_exploration(list.nrt)
## 4 columns ignored with more than 50 categories.
## Property_Type: 64 categories
## amenities: 7467 categories
## last_review: 987 categories
## host_verifications: 192 categories
We will now check out outliers in our data for various parameters, filtering for listings that have seen at least one booking since 1 Jan 2019, starting with Singapore data.
generate_price_boxplot <- function (listing.clean, city, comparison_col = "")
{
# png(file = "./graphs/boxplot.png")
if (comparison_col == "")
{
boxplot(listing.clean$Price, data = listing.clean, ylab="Price", main=paste("Boxplot: Price for", city))
}
else
boxplot(listing.clean$Price ~ listing.clean[[comparison_col]], data = listing.clean, ylab="Price", xlab=comparison_col, main=paste("Boxplot: Price vs", comparison_col, "for", city))
# dev.off()
}
generate_price_boxplot(list_after_2019.sin, "Singapore") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "Room_Type") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "Property_Type") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "Capacity") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "Beds") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "neighbourhood_cleansed") #, sin_listing.clean$)
generate_price_boxplot(list_after_2019.sin, "Singapore", "Reviews") #, sin_listing.clean$)
Seems like a single boat in Bukit Merah area (possibly next to the marina at Keppel Bay) has a very high price, at $2500/ night. Let’s look that one up more closely.
# head(list_after_2019.sin %>% arrange(desc(Price)))
# head(list_after_2019.sin %>% arrange(desc(reviews_since_2019)))
head(list_after_2019.sin %>% filter(Property_Type == "Boat") %>% arrange (desc(reviews_since_2019)))
## id Price Reviews Beds Baths Capacity Monthly_Reviews Property_Type
## 1 31527262 344 217 1 NA 2 6.24 Boat
## 2 37907711 199 177 3 NA 4 6.23 Boat
## 3 20247516 2500 55 4 NA 5 1.05 Boat
## 4 50433019 288 7 2 NA 5 2.50 Boat
## Room_Type Rating neighbourhood_cleansed host_response_time
## 1 Entire home/apt 4.94 Southern Islands within an hour
## 2 Entire home/apt 4.49 Punggol within an hour
## 3 Entire home/apt 4.74 Bukit Merah within a day
## 4 Entire home/apt 5.00 Punggol within a few hours
## host_response_rate host_acceptance_rate host_Superhost latitude longitude
## 1 <NA> 1.00 1 1.24535 103.8387
## 2 <NA> 0.96 0 1.41585 103.9001
## 3 <NA> 0.98 0 1.26520 103.8190
## 4 <NA> 0.92 0 1.41480 103.8986
## amenities
## 1 Toaster,Sound system,Hangers,Bed linens,Hot water kettle,Coffee maker,Carbon monoxide alarm,Hair dryer,TV,Outdoor furniture,Dining table,Security cameras on property,Outdoor dining area,Private entrance,Refrigerator,Microwave,Waterfront,Wifi,Smoke alarm,Shampoo,Portable fans,Paid parking off premises,Hot water,Mini fridge,Essentials,First aid kit,Dishes and silverware,Air conditioning,Shower gel,Fire extinguisher
## 2 Sound system,Hangers,Bed linens,Coffee maker,Hair dryer,TV,Paid parking on premises,Lockbox,Room-darkening shades,Security cameras on property,Long term stays allowed,Patio or balcony,Pour-over coffee,Microwave,Waterfront,Wifi,Smoke alarm,Shampoo,Extra pillows and blankets,Hot water,Mini fridge,Essentials,Kitchen,EV charger,First aid kit,Air conditioning,Shower gel,Fire extinguisher
## 3 Shampoo,Essentials,Long term stays allowed,Carbon monoxide alarm,Hair dryer,Host greets you,Hot water,Pool,TV,Paid parking on premises,Air conditioning,Smoke alarm,Fire extinguisher
## 4 Toaster,Bidet,Sound system,Rice maker,Hangers,Bed linens,Hot water kettle,Cooking basics,Freezer,Washer,Bathtub,Hair dryer,TV,Outdoor furniture,Dining table,Dedicated workspace,Free parking on premises,Lockbox,Cleaning products,Clothing storage: closet,dresser,and wardrobe,Long term stays allowed,Outdoor dining area,Private entrance,Pocket wifi,Refrigerator,Waterfront,Wifi,Smoke alarm,Induction stove,Dishwasher,Extra pillows and blankets,Portable fans,Hot water,Mini fridge,Iron,Essentials,Boat slip,Kitchen,EV charger,First aid kit,Air conditioning,Dishes and silverware,Fire extinguisher
## last_review no_of_am Amenities_Wifi Amenities_Shampoo Amenities_Kitchen
## 1 2021-09-15 30 1 1 0
## 2 2021-12-12 28 1 1 1
## 3 2020-07-28 13 0 1 0
## 4 2021-12-07 45 1 0 1
## Amenities_Long_Term Amenities_Washer Amenities_HairDryer Amenities_HotWater
## 1 0 0 1 1
## 2 1 0 1 1
## 3 1 0 1 1
## 4 1 1 1 1
## Amenities_TV Amenities_AC
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## host_verifications
## 1 'email','phone','jumio','offline_government_id','selfie','government_id','identity_manual'
## 2 'email','phone','jumio','offline_government_id','selfie','government_id','identity_manual'
## 3 'email','phone','reviews','jumio','selfie','government_id','identity_manual'
## 4 'phone'
## hv_email hv_phone hv_facebook hv_reviews hv_manual_offline hv_manual_jumio
## 1 1 1 0 0 0 1
## 2 1 1 0 0 0 1
## 3 1 1 0 1 0 1
## 4 0 1 0 0 0 0
## hv_manual_off_gov hv_manual_gov hv_manual_work_email no_of_vf
## 1 1 1 0 7
## 2 1 1 0 7
## 3 0 1 0 7
## 4 0 0 0 1
## host_response_hours 1 0 host_response_day host_response_few_days
## 1 NA 1 0 NA NA
## 2 NA 1 0 NA NA
## 3 NA 1 0 NA NA
## 4 NA 1 0 NA NA
## Days_since_last_review Capacity_Sqr Beds_Sqr Baths_Sqr ln_Price ln_Beds
## 1 107 4 1 NA 5.843544 0.6931472
## 2 19 16 9 NA 5.298317 1.3862944
## 3 521 25 16 NA 7.824446 1.6094379
## 4 24 25 4 NA 5.666427 1.0986123
## ln_Baths ln_Capacity ln_Rating Shared_ind House_ind Private_ind
## 1 NA 1.098612 1.781709 0 1 0
## 2 NA 1.609438 1.702928 0 1 0
## 3 NA 1.791759 1.747459 0 1 0
## 4 NA 1.791759 1.791759 0 1 0
## Capacity_x_Shared_ind H_Cap P_Cap ln_Capacity_x_Shared_ind
## 1 0 2 0 0
## 2 0 4 0 0
## 3 0 5 0 0
## 4 0 5 0 0
## ln_Capacity_x_House_ind ln_Capacity_x_Private_ind reviews_since_2019
## 1 1.098612 0 217
## 2 1.609438 0 177
## 3 1.791759 0 26
## 4 1.791759 0 7
## bookings_since_2019 earnings_since_2019 nb_group_1 nb_group_2 nb_group_3
## 1 434 447888 1 0 0
## 2 354 211338 0 1 0
## 3 52 390000 0 0 0
## 4 14 12096 0 1 0
## nb_group_4 nb_group_5
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
head(list_after_2019.sin %>% group_by(id, Property_Type, bookings_since_2019) %>% summarise(percent_of_total = bookings_since_2019*100/sum(list_after_2019.sin$bookings_since_2019)) %>% filter(Property_Type == "Boat") %>% arrange (desc(bookings_since_2019)))
## `summarise()` has grouped output by 'id', 'Property_Type'. You can override
## using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: id, Property_Type [4]
## id Property_Type bookings_since_2019 percent_of_total
## <int> <chr> <dbl> <dbl>
## 1 31527262 Boat 434 0.967
## 2 37907711 Boat 354 0.788
## 3 20247516 Boat 52 0.116
## 4 50433019 Boat 14 0.0312
There are four boats listed on Airbnb Singapore. Together, they form roughly 2% of all bookings since 2019.
list_of_vars = c("earnings_since_2019","Rating", "Reviews", "Beds", "Capacity", "host_acceptance_rate", "host_Superhost","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer","Amenities_HairDryer", "Amenities_HotWater", "Amenities_TV", "Amenities_AC", "hv_email", "hv_reviews", "Shared_ind", "House_ind", "Private_ind")#, "reviews_since_2019","bookings_since_2019") #, "hood_factor")
# list_after_2019.sin %>% select_(.dots = c(list_of_vars), "Price")
vars_list.sin = list_after_2019.sin %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
vars_list.tpe = list_after_2019.tpe %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
vars_list.hkg = list_after_2019.hkg %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
vars_list.nrt = list_after_2019.nrt %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
# vars_list.sin
paint_corrleations <- function(listing)
{
# chart.Correlation(listing, histogram=TRUE, pch=19)
corrplot::corrplot(cor(listing, use = "complete.obs"), method="square", type="lower")
}
paint_corrleations(vars_list.sin)
paint_corrleations(vars_list.nrt)
paint_corrleations(vars_list.tpe)
paint_corrleations(vars_list.hkg)
Principal Components Regression could find M linear combinations (“principal components”) of our predictors (list_of_vars) and then use least squares to fit a linear regression model.
set.seed(1)
pcr_model <- function (listings, city)
{
pcr_model <- pcr( data=listings, scale=TRUE, validation="CV", Price ~ reviews_since_2019 + Rating + host_acceptance_rate +host_Superhost + reviews_since_2019 + Shared_ind + House_ind + Private_ind + Amenities_Wifi + hv_email)
summary(pcr_model)
plot(pcr_model)
validationplot(pcr_model, val.type="MSEP")
validationplot(pcr_model, val.type="R2")
print(paste("MAE for", city,":", mae(listings$Price, predict(pcr_model))))
return (pcr_model)
}
pcr_model.sin <-pcr_model(list_after_2019.sin, "Singapore")
## Data: X dimension: 1405 9
## Y dimension: 1405 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 152.6 142.9 142.1 141.6 142.4 142.2 142.3
## adjCV 152.6 142.9 142.0 141.6 142.3 142.1 142.2
## 7 comps 8 comps 9 comps
## CV 140.8 140.7 140.6
## adjCV 140.7 140.6 140.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 22.14 40.73 53.25 64.19 74.57 84.27 92.71 100.00
## Price 12.89 13.94 14.45 14.45 15.23 15.23 16.90 17.21
## 9 comps
## X 100.00
## Price 17.21
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Singapore : 108.772298292697"
pcr_model.hkg <-pcr_model(list_after_2019.hkg, "Hong Kong")
## Data: X dimension: 1945 9
## Y dimension: 1945 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 2455 2455 2455 2454 2455 2456 2457
## adjCV 2455 2455 2455 2454 2455 2456 2456
## 7 comps 8 comps 9 comps
## CV 2458 2455 2455
## adjCV 2457 2454 2455
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 21.7730 37.2315 50.3304 62.1254 72.9752 83.0920 91.9797 100.0000
## Price 0.2567 0.2572 0.4119 0.4395 0.4482 0.5174 0.5345 0.8255
## 9 comps
## X 100.0000
## Price 0.8256
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Hong Kong : 869.547093791851"
pcr_model.tpe <-pcr_model(list_after_2019.tpe, "Taipei")
## Data: X dimension: 2175 9
## Y dimension: 2175 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4229 4170 4170 4153 4154 4156 4148
## adjCV 4229 4170 4169 4153 4153 4156 4148
## 7 comps 8 comps 9 comps
## CV 4134 4131 4132
## adjCV 4133 4130 4131
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.408 39.493 52.740 64.281 74.915 84.836 93.121 100.000
## Price 2.953 3.029 3.792 3.795 3.795 4.154 4.932 5.071
## 9 comps
## X 100.000
## Price 5.078
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Taipei : 2209.48295156008"
pcr_model.nrt <-pcr_model(list_after_2019.nrt, "Tokyo")
## Data: X dimension: 7249 9
## Y dimension: 7249 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 29610 29516 29490 29496 29487 29489 29492
## adjCV 29610 29516 29489 29495 29486 29488 29491
## 7 comps 8 comps 9 comps
## CV 29492 29494 29495
## adjCV 29490 29492 29493
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 22.2622 37.4091 49.4255 61.1240 71.8952 82.0212 91.8955 100.0000
## Price 0.6487 0.8555 0.8623 0.9325 0.9325 0.9327 0.9626 0.9718
## 9 comps
## X 100.0000
## Price 0.9733
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Tokyo : 12041.061870605"
The mean absolute error for each city is about $108.
set.seed(1)
pcr_model_earnings <- function (listings, city)
{
pcr_model <- pcr( data=listings, scale=TRUE, validation="CV", earnings_since_2019 ~ Price + reviews_since_2019 + Rating + host_acceptance_rate +host_Superhost + reviews_since_2019 + Shared_ind + House_ind + Private_ind + Amenities_Wifi + hv_email) #",
# "host_Superhost", "no_of_am","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer",
# "Amenities_HairDryer", "Amenities_HotWater", "Amenities_TV", "Amenities_AC", "hv_email", "hv_facebook", "hv_reviews",
# "hv_manual_offline", "hv_manual_jumio", "hv_manual_off_gov", "hv_manual_gov", "hv_manual_work_email", "no_of_vf", "Days_since_last_review",
# , "reviews_since_2019","bookings_since_2019")
summary(pcr_model)
plot(pcr_model)
validationplot(pcr_model, val.type="MSEP")
validationplot(pcr_model, val.type="R2")
print(paste("MAE for", city,":", mae(listings$earnings_since_2019, predict(pcr_model))))
return (pcr_model)
}
pcr_model.sin <-pcr_model_earnings(list_after_2019.sin, "Singapore")
## Data: X dimension: 1405 10
## Y dimension: 1405 1
## Fit method: svdpc
## Number of components considered: 10
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 31428 29557 28372 28288 25678 25378 22925
## adjCV 31428 29554 28366 28286 24823 25441 23114
## 7 comps 8 comps 9 comps 10 comps
## CV 17510 17465 16439 16436
## adjCV 17453 17414 16395 16391
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 22.14 38.99 50.4 60.24 69.93 78.66
## earnings_since_2019 11.70 18.99 19.6 37.12 41.06 52.10
## 7 comps 8 comps 9 comps 10 comps
## X 87.03 93.82 100.00 100.00
## earnings_since_2019 71.88 72.12 75.15 75.15
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Singapore : 20085.8644280392"
pcr_model.hkg <-pcr_model_earnings(list_after_2019.hkg, "Hong Kong")
## Data: X dimension: 1945 10
## Y dimension: 1945 1
## Fit method: svdpc
## Number of components considered: 10
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 410901 402205 390090 377536 381124 376353 341603
## adjCV 410901 402288 390081 377417 381617 367444 338156
## 7 comps 8 comps 9 comps 10 comps
## CV 346454 318750 295779 295967
## adjCV 343001 315592 292494 292664
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 19.648 33.56 45.45 56.10 66.09 75.83
## earnings_since_2019 3.538 10.11 15.39 15.39 48.30 49.76
## 7 comps 8 comps 9 comps 10 comps
## X 84.87 92.86 100.00 100.00
## earnings_since_2019 49.91 58.56 65.66 65.66
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Hong Kong : 200007.767176238"
pcr_model.tpe <-pcr_model_earnings(list_after_2019.tpe, "Taipei")
## Data: X dimension: 2175 10
## Y dimension: 2175 1
## Fit method: svdpc
## Number of components considered: 10
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1181326 1104149 1103314 1071703 1072319 990895 976066
## adjCV 1181326 1104158 1103341 1071364 1073671 959962 972962
## 7 comps 8 comps 9 comps 10 comps
## CV 967203 780310 753804 752060
## adjCV 964362 776832 750544 748890
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 22.49 36.10 48.35 58.74 68.46 78.03
## earnings_since_2019 12.47 12.77 18.39 18.73 38.85 38.90
## 7 comps 8 comps 9 comps 10 comps
## X 86.66 93.84 100.00 100.00
## earnings_since_2019 42.44 62.81 65.52 65.52
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Taipei : 621874.28842328"
pcr_model.nrt <-pcr_model_earnings(list_after_2019.nrt, "Tokyo")
## Data: X dimension: 7249 10
## Y dimension: 7249 1
## Fit method: svdpc
## Number of components considered: 10
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7764047 7613827 7340841 7216238 7147811 5899329 5450582
## adjCV 7764047 7613976 7341342 7222899 7149043 5652462 5440735
## 7 comps 8 comps 9 comps 10 comps
## CV 4892522 4785782 4762564 4762764
## adjCV 4863554 4777938 4754861 4755034
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 20.165 33.87 44.69 55.31 65.04 74.74
## earnings_since_2019 3.815 10.69 14.11 15.84 50.72 53.89
## 7 comps 8 comps 9 comps 10 comps
## X 83.85 92.71 100.00 100.00
## earnings_since_2019 63.09 64.32 64.64 64.64
## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length
## [1] "MAE for Tokyo : 3905051.77520906"
# lm0 <- lm(Price ~ Capacity, data = sin_listing.clean)
# summary(lm0)
# stargazer(lm0, type = "text")
#
# ggplot(data = sin_listing.clean, aes(x = Capacity, y = Price)) + geom_point(aes(size=3)) +
# scale_colour_hue(l=50) + # Use a slightly darker palette than normal
# geom_smooth(method=lm, # Add linear regression lines
# se=TRUE, # add shaded confidence region
# fullrange=TRUE) +
# theme(axis.text.x = element_text(size=15), axis.text.y = element_text(size=15),
# axis.title=element_text(size=15,face="bold"))
# # vif(lm0)
# ```
# ```{r}
#
# # The moderating effect of type of room. Lets model that.
#
# lm1 <- lm(Price ~ Private_ind + House_ind, data = sin_listing.clean)
# summary(lm1)
# stargazer(lm1, type = "text")
# ```
# ```{r}
# #Regression with Capacity and Dummy Variables for type of room:
#
# lm2 <- lm(Price ~ Capacity + Private_ind + House_ind, data = sin_listing.clean)
# summary(lm2)
# stargazer(lm2, type = "text")
#Regression with Capacity,Dummy Variables and interaction between the two:
# lm3 <- lm(Price ~ Capacity+Private_ind + House_ind+P_Cap+H_Cap, data = sin_listing.clean)
# summary(lm3)
# stargazer(lm3,type = "text")
Comments from Kevin: - I have arbitrarily selected most of the variables, removing those which do not make sense (e.g. id, latitude) and those completely N.A. (e.g Baths) - Instead of imputing, I have removed rows with N.A. In the case of SG data, the entries goes from 1725 to 1373.
### For checking number of missing data
# md.pattern(list_after_2019.country)
clean_subset <- function(list_after_2019.country) {
### Arbitrary selection of a list of variables
selecting_columns <- list_after_2019.sin[,c("Price","Reviews","Beds","Capacity","Monthly_Reviews","Property_Type","Room_Type","Rating","neighbourhood_cleansed","host_response_time","host_acceptance_rate","host_Superhost","no_of_am","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer","Amenities_HairDryer","Amenities_HotWater","Amenities_TV","Amenities_AC","hv_email","hv_phone","hv_facebook","hv_reviews","hv_manual_offline","hv_manual_jumio","hv_manual_off_gov","hv_manual_gov","hv_manual_work_email","no_of_vf","Days_since_last_review","Capacity_Sqr","Beds_Sqr","ln_Price","ln_Beds","ln_Capacity","ln_Rating","Shared_ind","House_ind","Private_ind","Capacity_x_Shared_ind","H_Cap","P_Cap","ln_Capacity_x_Shared_ind","ln_Capacity_x_House_ind","ln_Capacity_x_Private_ind","reviews_since_2019","bookings_since_2019", "earnings_since_2019" )]
### Removing rows with blanks instead of imputing
selecting_columns <- na.omit(selecting_columns)
selecting_columns$Property_Type <- as.factor(selecting_columns$Property_Type)
selecting_columns$neighbourhood_cleansed <- as.factor(selecting_columns$neighbourhood_cleansed )
selecting_columns$host_response_time <- as.factor(selecting_columns$host_response_time)
return(selecting_columns)
}
list_after_2019.sin_clean <- clean_subset(list_after_2019.sin)
Comments from Kevin: Can help to double check if the R-squared value looks too high at 0.80 for a real-world dataset?
#Define Smallest and Full Model
minmod = lm(Price~1, data = list_after_2019.sin_clean)
fullmod = lm(Price~. , data = list_after_2019.sin_clean)
# Using BIC: k=log(nobs(fullmod), Using AIC: k=2
backward_regression_model <- step(fullmod, scope = list(lower = minmod, upper = fullmod),direction = "backward", k=log(nobs(fullmod)), trace=F)
summary(backward_regression_model)
##
## Call:
## lm(formula = Price ~ Reviews + no_of_am + Amenities_Wifi + Amenities_Shampoo +
## Amenities_Kitchen + Amenities_TV + no_of_vf + ln_Price +
## ln_Capacity + H_Cap + ln_Capacity_x_Shared_ind + ln_Capacity_x_House_ind +
## reviews_since_2019 + earnings_since_2019, data = list_after_2019.sin_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -385.43 -30.80 -10.79 19.62 1067.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.257e+02 2.382e+01 -26.269 < 2e-16 ***
## Reviews 4.980e-01 7.689e-02 6.477 1.31e-10 ***
## no_of_am 1.692e+00 2.281e-01 7.418 2.09e-13 ***
## Amenities_Wifi -6.285e+01 1.622e+01 -3.874 0.000112 ***
## Amenities_Shampoo -2.378e+01 4.258e+00 -5.586 2.81e-08 ***
## Amenities_Kitchen -2.062e+01 5.487e+00 -3.758 0.000178 ***
## Amenities_TV -1.695e+01 5.063e+00 -3.347 0.000839 ***
## no_of_vf 3.267e+00 1.019e+00 3.206 0.001377 **
## ln_Price 1.839e+02 4.233e+00 43.442 < 2e-16 ***
## ln_Capacity -3.195e+01 7.289e+00 -4.384 1.25e-05 ***
## H_Cap 1.722e+01 3.167e+00 5.437 6.41e-08 ***
## ln_Capacity_x_Shared_ind 6.649e+01 7.385e+00 9.003 < 2e-16 ***
## ln_Capacity_x_House_ind -5.521e+01 8.661e+00 -6.375 2.50e-10 ***
## reviews_since_2019 -2.630e+00 1.827e-01 -14.392 < 2e-16 ***
## earnings_since_2019 1.863e-03 1.094e-04 17.027 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.8 on 1358 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8096
## F-statistic: 417.6 on 14 and 1358 DF, p-value: < 2.2e-16
# Putting forward stepwise regression in comments in case we need to use
# forward_regression = step(minmod, scope = list(lower = minmod, upper = fullmod),direction = "forward", k=log(nobs(fullmod)), trace=F)
# summary(forward_regression)
set.seed(123)
# Splitting the data into 80:20
train_idx <- sample(1:nrow(list_after_2019.sin_clean), 0.8*nrow(list_after_2019.sin_clean))
training_sin <- list_after_2019.sin_clean[train_idx,]
testing_sin <- list_after_2019.sin_clean[-train_idx,]
final_model_sin <- lm(formula = Price ~ ln_Price + Property_Type + ln_Capacity_x_Private_ind +
Capacity + Amenities_Wifi + Amenities_Shampoo + Amenities_AC +
no_of_am + Amenities_Kitchen + hv_email, data = training_sin)
summary(final_model_sin)
##
## Call:
## lm(formula = Price ~ ln_Price + Property_Type + ln_Capacity_x_Private_ind +
## Capacity + Amenities_Wifi + Amenities_Shampoo + Amenities_AC +
## no_of_am + Amenities_Kitchen + hv_email, data = training_sin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -410.08 -32.78 -7.09 24.00 1196.07
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -424.8837 51.2040 -8.298
## ln_Price 219.6829 5.0341 43.639
## Property_TypeCampsite -595.0186 60.3870 -9.853
## Property_TypeEntire condominium (condo) -372.3346 37.4185 -9.951
## Property_TypeEntire guest suite -338.4095 63.2327 -5.352
## Property_TypeEntire guesthouse -357.4847 81.3368 -4.395
## Property_TypeEntire loft -400.6474 81.1304 -4.938
## Property_TypeEntire place -382.1517 56.0709 -6.816
## Property_TypeEntire rental unit -362.0118 37.5349 -9.645
## Property_TypeEntire residential home -382.4777 41.7566 -9.160
## Property_TypeEntire serviced apartment -345.4811 37.4602 -9.223
## Property_TypeEntire townhouse -411.3871 63.1003 -6.520
## Property_TypePrivate room -258.9714 53.9745 -4.798
## Property_TypePrivate room in bed and breakfast -257.4717 55.3228 -4.654
## Property_TypePrivate room in bungalow -241.7481 45.8062 -5.278
## Property_TypePrivate room in condominium (condo) -205.8855 41.1581 -5.002
## Property_TypePrivate room in guest suite -227.2684 82.9654 -2.739
## Property_TypePrivate room in hostel -232.9576 55.0536 -4.231
## Property_TypePrivate room in loft -208.0485 51.7502 -4.020
## Property_TypePrivate room in rental unit -206.8144 40.4686 -5.110
## Property_TypePrivate room in residential home -199.0148 41.7347 -4.769
## Property_TypePrivate room in serviced apartment -264.9364 45.1590 -5.867
## Property_TypePrivate room in townhouse -218.4113 44.5380 -4.904
## Property_TypePrivate room in villa -234.0597 52.7938 -4.433
## Property_TypeRoom in aparthotel -241.6445 82.4276 -2.932
## Property_TypeRoom in boutique hotel -284.8855 41.1833 -6.918
## Property_TypeRoom in hotel -267.9793 41.7109 -6.425
## Property_TypeShared room -186.7206 49.1121 -3.802
## Property_TypeShared room in bed and breakfast -200.6376 44.9283 -4.466
## Property_TypeShared room in boutique hotel -267.6705 56.5970 -4.729
## Property_TypeShared room in hostel -246.8231 42.8733 -5.757
## Property_TypeShared room in rental unit -176.6594 53.0975 -3.327
## Property_TypeShared room in residential home -173.6442 64.7973 -2.680
## Property_TypeTent -545.9528 82.9690 -6.580
## Property_TypeTiny house -387.0692 81.1098 -4.772
## ln_Capacity_x_Private_ind -77.3574 12.2706 -6.304
## Capacity 6.2504 1.4168 4.412
## Amenities_Wifi -136.0849 22.9387 -5.933
## Amenities_Shampoo -25.2203 5.6615 -4.455
## Amenities_AC -37.5466 17.6964 -2.122
## no_of_am 0.9322 0.2995 3.112
## Amenities_Kitchen -21.4168 7.7589 -2.760
## hv_email 29.3148 9.7303 3.013
## Pr(>|t|)
## (Intercept) 3.21e-16 ***
## ln_Price < 2e-16 ***
## Property_TypeCampsite < 2e-16 ***
## Property_TypeEntire condominium (condo) < 2e-16 ***
## Property_TypeEntire guest suite 1.07e-07 ***
## Property_TypeEntire guesthouse 1.22e-05 ***
## Property_TypeEntire loft 9.16e-07 ***
## Property_TypeEntire place 1.58e-11 ***
## Property_TypeEntire rental unit < 2e-16 ***
## Property_TypeEntire residential home < 2e-16 ***
## Property_TypeEntire serviced apartment < 2e-16 ***
## Property_TypeEntire townhouse 1.09e-10 ***
## Property_TypePrivate room 1.83e-06 ***
## Property_TypePrivate room in bed and breakfast 3.67e-06 ***
## Property_TypePrivate room in bungalow 1.59e-07 ***
## Property_TypePrivate room in condominium (condo) 6.63e-07 ***
## Property_TypePrivate room in guest suite 0.006261 **
## Property_TypePrivate room in hostel 2.52e-05 ***
## Property_TypePrivate room in loft 6.23e-05 ***
## Property_TypePrivate room in rental unit 3.81e-07 ***
## Property_TypePrivate room in residential home 2.12e-06 ***
## Property_TypePrivate room in serviced apartment 5.94e-09 ***
## Property_TypePrivate room in townhouse 1.09e-06 ***
## Property_TypePrivate room in villa 1.02e-05 ***
## Property_TypeRoom in aparthotel 0.003445 **
## Property_TypeRoom in boutique hotel 7.96e-12 ***
## Property_TypeRoom in hotel 2.00e-10 ***
## Property_TypeShared room 0.000152 ***
## Property_TypeShared room in bed and breakfast 8.84e-06 ***
## Property_TypeShared room in boutique hotel 2.56e-06 ***
## Property_TypeShared room in hostel 1.12e-08 ***
## Property_TypeShared room in rental unit 0.000908 ***
## Property_TypeShared room in residential home 0.007481 **
## Property_TypeTent 7.39e-11 ***
## Property_TypeTiny house 2.08e-06 ***
## ln_Capacity_x_Private_ind 4.25e-10 ***
## Capacity 1.13e-05 ***
## Amenities_Wifi 4.04e-09 ***
## Amenities_Shampoo 9.30e-06 ***
## Amenities_AC 0.034095 *
## no_of_am 0.001908 **
## Amenities_Kitchen 0.005875 **
## hv_email 0.002651 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.22 on 1055 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7952
## F-statistic: 102.4 on 42 and 1055 DF, p-value: < 2.2e-16
Comments from Kevin: Calculated the MAE value, but this number is probably more useful if its a comparison between models.
predict_sin <- predict(final_model_sin,newdata=testing_sin)
MAE_sin <- mean(abs(predict_sin - testing_sin$Price))
Comments from Kevin: There seems to be out-lying points and issue of multicollinearity - we could either try to resolve or in the interest of time, just state that this is subsequently something we could look into
cook=cooks.distance(final_model_sin)
plot(cook,type='h',lwd=3,ylab="Cook's Distance")
vif(final_model_sin)
## GVIF Df GVIF^(1/(2*Df))
## ln_Price 2.920011 1 1.708804
## Property_Type 125.889002 33 1.076014
## ln_Capacity_x_Private_ind 10.885166 1 3.299268
## Capacity 2.055544 1 1.433717
## Amenities_Wifi 1.492684 1 1.221754
## Amenities_Shampoo 1.489170 1 1.220316
## Amenities_AC 1.294541 1 1.137779
## no_of_am 1.562050 1 1.249820
## Amenities_Kitchen 1.895419 1 1.376742
## hv_email 1.110136 1 1.053630